15 research outputs found
LatentKeypointGAN: Controlling GANs via Latent Keypoints
Generative adversarial networks (GANs) have attained photo-realistic quality
in image generation. However, how to best control the image content remains an
open challenge. We introduce LatentKeypointGAN, a two-stage GAN which is
trained end-to-end on the classical GAN objective with internal conditioning on
a set of space keypoints. These keypoints have associated appearance embeddings
that respectively control the position and style of the generated objects and
their parts. A major difficulty that we address with suitable network
architectures and training schemes is disentangling the image into spatial and
appearance factors without domain knowledge and supervision signals. We
demonstrate that LatentKeypointGAN provides an interpretable latent space that
can be used to re-arrange the generated images by re-positioning and exchanging
keypoint embeddings, such as generating portraits by combining the eyes, nose,
and mouth from different images. In addition, the explicit generation of
keypoints and matching images enables a new, GAN-based method for unsupervised
keypoint detection
GANSeg: Learning to Segment by Unsupervised Hierarchical Image Generation
Segmenting an image into its parts is a frequent preprocess for high-level
vision tasks such as image editing. However, annotating masks for supervised
training is expensive. Weakly-supervised and unsupervised methods exist, but
they depend on the comparison of pairs of images, such as from multi-views,
frames of videos, and image augmentation, which limits their applicability. To
address this, we propose a GAN-based approach that generates images conditioned
on latent masks, thereby alleviating full or weak annotations required in
previous approaches. We show that such mask-conditioned image generation can be
learned faithfully when conditioning the masks in a hierarchical manner on
latent keypoints that define the position of parts explicitly. Without
requiring supervision of masks or points, this strategy increases robustness to
viewpoint and object positions changes. It also lets us generate image-mask
pairs for training a segmentation network, which outperforms the
state-of-the-art unsupervised segmentation methods on established benchmarks.Comment: CVPR 202
Pose Modulated Avatars from Video
It is now possible to reconstruct dynamic human motion and shape from a
sparse set of cameras using Neural Radiance Fields (NeRF) driven by an
underlying skeleton. However, a challenge remains to model the deformation of
cloth and skin in relation to skeleton pose. Unlike existing avatar models that
are learned implicitly or rely on a proxy surface, our approach is motivated by
the observation that different poses necessitate unique frequency assignments.
Neglecting this distinction yields noisy artifacts in smooth areas or blurs
fine-grained texture and shape details in sharp regions. We develop a
two-branch neural network that is adaptive and explicit in the frequency
domain. The first branch is a graph neural network that models correlations
among body parts locally, taking skeleton pose as input. The second branch
combines these correlation features to a set of global frequencies and then
modulates the feature encoding. Our experiments demonstrate that our network
outperforms state-of-the-art methods in terms of preserving details and
generalization capabilities
Hinge-Wasserstein: Mitigating Overconfidence in Regression by Classification
Modern deep neural networks are prone to being overconfident despite their
drastically improved performance. In ambiguous or even unpredictable real-world
scenarios, this overconfidence can pose a major risk to the safety of
applications. For regression tasks, the regression-by-classification approach
has the potential to alleviate these ambiguities by instead predicting a
discrete probability density over the desired output. However, a density
estimator still tends to be overconfident when trained with the common NLL
loss. To mitigate the overconfidence problem, we propose a loss function,
hinge-Wasserstein, based on the Wasserstein Distance. This loss significantly
improves the quality of both aleatoric and epistemic uncertainty, compared to
previous work. We demonstrate the capabilities of the new loss on a synthetic
dataset, where both types of uncertainty are controlled separately. Moreover,
as a demonstration for real-world scenarios, we evaluate our approach on the
benchmark dataset Horizon Lines in the Wild. On this benchmark, using the
hinge-Wasserstein loss reduces the Area Under Sparsification Error (AUSE) for
horizon parameters slope and offset, by 30.47% and 65.00%, respectively
Mirror-Aware Neural Humans
Human motion capture either requires multi-camera systems or is unreliable
using single-view input due to depth ambiguities. Meanwhile, mirrors are
readily available in urban environments and form an affordable alternative by
recording two views with only a single camera. However, the mirror setting
poses the additional challenge of handling occlusions of real and mirror image.
Going beyond existing mirror approaches for 3D human pose estimation, we
utilize mirrors for learning a complete body model, including shape and dense
appearance. Our main contributions are extending articulated neural radiance
fields to include a notion of a mirror, making it sample-efficient over
potential occlusion regions. Together, our contributions realize a
consumer-level 3D motion capture system that starts from off-the-shelf 2D poses
by automatically calibrating the camera, estimating mirror orientation, and
subsequently lifting 2D keypoint detections to 3D skeleton pose that is used to
condition the mirror-aware NeRF. We empirically demonstrate the benefit of
learning a body model and accounting for occlusion in challenging mirror
scenes.Comment: Project website:
https://danielajisafe.github.io/mirror-aware-neural-humans
AudioViewer: Learning to Visualize Sounds
A long-standing goal in the field of sensory substitution is to enable sound
perception for deaf and hard of hearing (DHH) people by visualizing audio
content. Different from existing models that translate to hand sign language,
between speech and text, or text and images, we target immediate and low-level
audio to video translation that applies to generic environment sounds as well
as human speech. Since such a substitution is artificial, without labels for
supervised learning, our core contribution is to build a mapping from audio to
video that learns from unpaired examples via high-level constraints. For
speech, we additionally disentangle content from style, such as gender and
dialect. Qualitative and quantitative results, including a human study,
demonstrate that our unpaired translation approach maintains important audio
features in the generated video and that videos of faces and numbers are well
suited for visualizing high-dimensional audio features that can be parsed by
humans to match and distinguish between sounds and words. Code and models are
available at https://chunjinsong.github.io/audioviewe
GMSF: Global Matching Scene Flow
We tackle the task of scene flow estimation from point clouds. Given a source
and a target point cloud, the objective is to estimate a translation from each
point in the source point cloud to the target, resulting in a 3D motion vector
field. Previous dominant scene flow estimation methods require complicated
coarse-to-fine or recurrent architectures as a multi-stage refinement. In
contrast, we propose a significantly simpler single-scale one-shot global
matching to address the problem. Our key finding is that reliable feature
similarity between point pairs is essential and sufficient to estimate accurate
scene flow. To this end, we propose to decompose the feature extraction step
via a hybrid local-global-cross transformer architecture which is crucial to
accurate and robust feature representations. Extensive experiments show that
GMSF sets a new state-of-the-art on multiple scene flow estimation benchmarks.
On FlyingThings3D, with the presence of occlusion points, GMSF reduces the
outlier percentage from the previous best performance of 27.4% to 11.7%. On
KITTI Scene Flow, without any fine-tuning, our proposed method shows
state-of-the-art performance